Rewrote Tensor.cat to be shorter and (hopefully) clearer #372

mitchellgoffpc · 2022-08-25T03:37:26Z

No description provided.

geohot · 2022-08-30T23:15:15Z

Nice! I think it's more readable too

* bugfixes * get_movementroot * add PAD movementop * fix permute stacking * some permutes are reshapes * SLICE -> PAD,SHRINK * test opencl, commit to removing the crap conv code from GPU * testopencl * fixup tests * opencl not imported * we need that opt to make gpu decent speed * w/e, that's a later prob * fix bug caused by rounding * fix opencl bug, no training on opencl * more crap to remove without convs * join expands * default opt level 2 * prune graph * don't shuffle if there's children involved * disable opencl tests * tests maybe * output file to disk * fix row pitch * inputs and outputs * outputs with size * buffer_id is 8 bytes * Fold reduce (tinygrad#362) * folding reduce * fold through movementops * fixup shapes * was too aggressive * i knew we needed that * don't recompute reduce * working * fix openpilot compile * prunegraph openpilot * types and reduce_shape * refactor * cleanups * neater * 1009 * 1004 * clean up reduce for 998 * touchups * adam in benchmark_train_efficientnet * REQUIRES_SIMPLE_REDUCE * save weights * zero out the buffer * needs_load in image correct * float16 fixups * that should be right * fix options on old pyopencl * fix that soon * remove useless init, add ops counter * fix op estimate * add gflop estimate * add time sum * notes * fix ane on new mac os x * amfi note * docs * notes * update readme * broken amfi patch * ane: procPath issue. don't waste more time with this, focus on core tinygrad * rawcpu (tinygrad#365) * rawcpu * add should work when we respect shapetracker * now that's true * still have to handle shapetracker * copyin * Fix mypy * tinygrad.nn (tinygrad#367) * tinygrad.nn * flake8 * working on pylint * more pylint * more pylint * pylint passes * networkx * mypy can't infer that type * junk * fixup run thneed * run_onnx_torch * reduce axis at the end * much simpler reduce * hmm, with the new reduce, we have to opt 3 for memory usage * maybe that's a better way to do this * less needless reshaping * 2 stage reduce * tune inter_red * t.assign in optim * add openpilot tests to tinygrad * enable the openpilot test * fix cpu thneed running * use functools.partialmethod (tinygrad#369) Co-authored-by: Kyle <kposborne@gmail.com> * run_thneed with test * fix test maybe * opencl can't optimize that * refactor getters * remove from_image * image input works * fix typing * native_exp is way faster on qcom * hmm, the native exp/log breaks it too much * float32 in image desc * thneed run float32 * oops, compare with abs * flip that * print inputs * no torch test if no torch * add reciprocal * still broken * line count * Rewrote Tensor.cat to be shorter and (hopefully) clearer (tinygrad#372) * Rewrote Tensor.cat to be shorter and (hopefully) clearer * Use cumsum[-1] instead of separate sum * typos * fix cl import error * fix wrong size input * TEST_ENET for openpilot compiler * fix batchnorm folding in openpilot compile * don't save input buffers * save free 200ms * stable diffusion start * fix tests hopefully, more stable diffusion * stable_diffusion: add attn and layernorm * torch bs * found tinygrad bug * yolo * fix check * stable diffusion works * remove ugly parens * cleanups for Mid * easier to read * more readable actually * one liner that's more clear * from_number_like to fix div issue * better idea for numbers, do the division in python * work * stable diffusion compiles (add no_init) * runs on torch cpu * Make creation helpers use fp32 by default (tinygrad#374) * Make creation helpers use fp32 by default half the big = twice the fast * Fix flake8 with an extra multiply * clip model is running * fix transformer bugs * fix last bug in unet probz * all models match * brown img * it renders something * cat horse winning ❗ * other prompt example * better alphas * stable diffusion cleanups * stable diffusion in readme * improve opencl, why is it OOMing * bring back native exp log * works at work * 1100 lines, but sane linter rules * fix stupid OPENCL=1 OOM * broadcast from right to left (tinygrad#375) * broadcast from right to left * add another broadcasted add test * fix sd with TORCH=1 * hmm, need this with broadcast change * simpler movement op * add div to operators * fix slice one multi, and linear can be simpler with new broadcasting * make gpu code readable * cpu line savings and cleaner * have to ignore that type * add Linear to tinygrad.nn * relax mnist test a tiny bit * support more onnx ops (tinygrad#376) * broadcast from right to left * add another broadcasted add test * more onnx ops * use float32 range in clip * change default opt to 2 * Revert "change default opt to 2" This reverts commit 726f4e9. * update serious_mnist.py (tinygrad#380) * Added standalone CLIP tokenizer (tinygrad#382) * Added standalone CLIP tokenizer. * Fixed empty phrase. * Truncating long prompts. * Keeping two slots for the start and end token. * Fixed empty phrase. * Using tokenizer for empty phrase. * Typo. * cleanup clip tokenizer * forgot a few * test_matmul * simple on device failing test * fix test failure on MATMUL=1 backward pass * fix matmul kernel and tests * add barrier * support float16 onnx weights (tinygrad#384) * add min support * that's simpler * import tests from CL metal texture fix * set requires_grad to None (tinygrad#387) * set requires_grad to None * some things need gradients * hmm, why was get_parameters filtering * clipnorm support * Reshape dataset from fetch_mnist (tinygrad#390) * fix mnist load from other dirs * move get_parameters to optim.py * Fix weight init: this work? (tinygrad#391) * this work? * glorot uniform * requies_grad broke * propagate the None correctly * so this weight init works * ahh, i think it's this * can't beat this * glorot is best for ae * remove comments * layernorm is all axis but the first * revert layernorm to have axis param * fix efficientnet * fix bn folding issue, add new test * fix tests * Device.GPU isn't definied * ugh, global state * should this be 10? * notrain test * external_test_opt * Fix OpenCL Metal texture issues (tinygrad#378) * Fix OpenCL Metal texture issues Tile CL images when needed, to fit into the 16384 max Metal image size; gets me to ~4.8s/iteration for SD on M1 Pro with OPENCL=1 FLOAT16=1. * Minor cleanup * Fix mish in CI, or no-op? * Is mish being framed? * It would help if any of this reproduced locally * ??? * OPT is reverted; use original mish * Cleanup post-review * Fix some shape usage * Tiler tests, shouldn't oom or overflow either * Can't CL if there's no CL? * Run tiler tests even if GPU=1 * relu6 segfault binary chop; revert test * relu6 segfault binary chop; revert accel * relu6 segfault binary chop; revert . (???) * end relu6 segfault binary chop; repo's haunted * some args for stable diffusion * test_sd_big_conv * always MATMUL, test the ops in OPENCL * ugh, why did that fail * Fix GPU 2**31 virtual size limit (tinygrad#392) * in progress * big conv test works * that's unneeded * fix opencl with reduce * rewrite contiguous_view_constant_fold * clean up mids in loop code * subidx * print cl kernel before run * no reduce, no loop * Revert "no reduce, no loop" This reverts commit 92777e4. * measure speed vs torch * touchup * remove redundant list comprehension from inside all. (tinygrad#397) remove explicit inherit from object. * enable tests in test_ops.py that are disabled but now work. (tinygrad#396) remove custom tolerances that don't appear to be needed. * openpilot: new models and onnx ops (tinygrad#401) * ngrl stuff * fngrl * fix typo in compile script * workflow dispatch * new models in tests * dont need to up this threshold Co-authored-by: HaraldSchafer <harald.the.engineer@gmail.com> * fix openpilot test * refactoring thneed (tinygrad#400) * refactoring thneed * continue * minor update * looks like it's working * big refactor * confirm thneed got the right output * code is there but it's broken * works now * always OPTWG, input -> dat * fix type issue * ReduceSum * fix thneed self test * read input shapes and break down the layers * rerun * zero out the inputs * remove useless buffer * add assert to catch issue in attention * safe_numpy and warning for broken matmul * add CONTIGUOUS loadop * don't recopy backing * might fix tests * raise, don't assert * fix nonstatic weights * really dumb bug * remove run_thneed dead code * replace networkx with defaultdict * move ops.py into lazy.py (tinygrad#402) * move ops.py into lazy.py * fix graph and linter * ugh, didn't add * relu simpler backward pass * more imports from llvm branch * LLVM Backend take 2 (tinygrad#403) * take 2 llvm * get_lazybuffers -> get_buffers * llvm tests pass * fix type issues and refactor LLVM * Exec AST (tinygrad#404) * working exec ast * exec_ast is staticmethod * GenericExecAST * fold that sometimes * ExplicitExecAST * exec_ast for GPU * gpu working * get_lazyop_shape * now gpubuffer is ExplicitExecAST * dedup * add a type * RESHAPE in opencl code * fix linter * that too for linter * cleanups * remove dead code * GenericShape is less lines * add ALLOWED_KERNEL_COUNT to tests * fix mypy * that's gotta be recursive * fix opencl shape processing * remove unneeded lambda * cleanups, remove E701 * can we lose the lines with E701 still there? * lazy cleanups * move into graph.py * fix flake8 * fix graph in openpilot/compile.py * hasattr and DeviceBuffer type fixups * clean up movement_op in cpu and torch * very minor * test speed w/o bias * more test opt * no RESHAPEs in the AST * MovementOps is unused * one more opt test * accurate flop estimation * llvm doesn't vectorize * vectorization * gemm is 1.7 TFLOPS on a single M1 core * more amx notes * oops, remove while(1) * seperate STRIDED and EXPAND * fix llvm vectorization by add analysis passes from the target machine * that was in there twice, DEBUG>=4 to see loop opt * rewrite some strideds into reshapes * fix bug in ops test, it was cheating somehow * stop blowing up floats * comments and readability in lazy.py * fix type error * 1s are always mergable * Gemm (tinygrad#416) * gemm * off by factor of 5 * 50 GFLOPS * works * 91 gflops * working at 50G * works * iy * 150 GFLOPS * 150 GFLOPS * N=2048 is still fast * threading soon * multithread * pinning * throttling is sad * Align matrices to cacheline width (tinygrad#361) Co-authored-by: cloud <Cloud11665@gmail.com> * updates from the chonker branch * fix termcolor import * ugh, that too * rename test functions to helper_ * bump version to 0.4.0 * Create python-publish.yml (tinygrad#163) * Fix issue where batch_invstd not being set (tinygrad#421) batch_invstd can be falsely assumed to be set, even though it is None since hasattr will not return false in this case BatchNorm2D a reshape will be attempted then, which causes an exception * Basic editorconfig support (tinygrad#422) Almost every IDE or texteditor supports [editorconfig](https://editorconfig.org/). I've set it up to just enforce the 2 space python indents for now. * contributing * more that * contrib more * Reduce line count (tinygrad#424) * save a line, save a life * save a line, save a life * change order of tern * factorizing shapetracker from chonker * contiguous, and no strided for matmul * Simple chonker (tinygrad#431) * chonker will make llvm fast * work * better speed tests, we will make them fast * with the cache add is the same speed * relu and neg are fast * fix sum speed * maximum maxnum? * hack for gemm opt * gemm very slow * zeros like * test_permute * shapetracker returns self * fix shapetracker factorization * err, int strides * permutes are faster now in tinygrad than pytorch * support -1 in expand * gemm unrolled * improve final test case * WIP GEMM * why isn't GEMM fast? * revert cache dim * ffp contract works on clang, not llvm? * ignore llvm ir * this makes fma work at least, but no faster * USE_4x4 * 63 GFLOPS * 87 GFLOPS * that wasn't matmul, 44 GFLOPS now * 82 GFLOPS permuted * this permute too * a little speed for the convs * 45 GFLOPS * speed tests pass again * clean up prints * fix FMA WHAT A WASTE OF TIME * colors * moar fair * GPU * useless on chonker * cleanups * improve factorized shapetracker * better threshold * label conv * work * ops test pass again * hot load the index * run the last view, no need to create * ZeroView needs a repr for the key to work * fix segfault on out of bounds * one more test * start amx, and llvm.initialize_native_asmparser * amx works * nice AMX class * nicer AMX class * refactor get_idxs * amx working * is slower... * useless flip * cache * SZ_X * AMX_SZ_X/Y work alone * Contiguous mlop * test gemm packed * PREPARE in packed * use_amx factor * prefetch isn't faster * loop * same 3ms * 2.24 ms * allow double on store in TG * amx reduce is the same speed as non amx reduce * include memory bandwidth * clean up shapetracker * flip returns stride * prepare for upstream * Update ops_llvm.py (tinygrad#426) * permutes are yellow and green now * faster conv * llvm cleanups * Show optimised IR under debug 4 (tinygrad#428) * ASTKernel class * Make tinygrad work with older python version (tinygrad#427) * Make tinygrad work with older python version * Use partialmethod instead of partial * smiple chonker is chonking * remove junk from test speed vs torch * fix linker and types * AMX is only here now * add LLVM tests, it's a valid backend now * oops, run llvm test * contiguous_op * fix loadops compare * dedup reduceops Co-authored-by: calledit <1573053+calledit@users.noreply.github.com> * s/contiguous_op/contiguous * the speedy chonker is going to replace the old chonker (tinygrad#432) * bringing back reshape and permute * done with E701 * 4x4 works in generic way * max and sum not vectorizing... * special case single float * support comparing to MPS * improve matmul speed, consider generic principles * GlobalCounter * fix op tracking * faster * comment that out for now * err, it needs that * fix minor issues * fix global_mem Co-authored-by: George Hotz <geohot@gmail.com> Co-authored-by: Comma Device <device@comma.ai> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: George Hotz <george@comma.ai> Co-authored-by: kposborne2 <53231580+kposborne2@users.noreply.github.com> Co-authored-by: Kyle <kposborne@gmail.com> Co-authored-by: Mitchell Goff <mitchellgoffpc@gmail.com> Co-authored-by: Ollin Boer Bohan <madebyollin@gmail.com> Co-authored-by: YassineYousfi <yyousfi1@binghamton.edu> Co-authored-by: David Redmon <85855920+redmonmd@users.noreply.github.com> Co-authored-by: Fernand Pajot <accounts@epigram.me> Co-authored-by: Jacky Lee <39754370+jla524@users.noreply.github.com> Co-authored-by: Drew Hintz <dhintz@squareup.com> Co-authored-by: HaraldSchafer <harald.the.engineer@gmail.com> Co-authored-by: cloud <Cloud11665@gmail.com> Co-authored-by: Liam <3579535@myuwc.ac.za> Co-authored-by: marcojob <44396071+marcojob@users.noreply.github.com> Co-authored-by: Daniel Davis <dan@dandavis.dev> Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>

Rewrote Tensor.cat to be shorter and (hopefully) clearer

387ea19

mitchellgoffpc force-pushed the shorter-cat-fn branch from a333c8c to 387ea19 Compare August 25, 2022 04:31

Use cumsum[-1] instead of separate sum

fa87dcf

geohot merged commit 3af650b into tinygrad:master Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrote Tensor.cat to be shorter and (hopefully) clearer #372

Rewrote Tensor.cat to be shorter and (hopefully) clearer #372

mitchellgoffpc commented Aug 25, 2022

geohot commented Aug 30, 2022

Rewrote Tensor.cat to be shorter and (hopefully) clearer #372

Rewrote Tensor.cat to be shorter and (hopefully) clearer #372

Conversation

mitchellgoffpc commented Aug 25, 2022

geohot commented Aug 30, 2022